Factor investing is a security selection approach that involves targeting specific elements or factors that appear to drive the returns of an asset. The premise behind Factor investing is that using quantitative methods, various risk premia can be identified isolated and utilized to the investors advantage. This idea is the corner stone behind almost all quantitative portfolio management methods. The impact factor investing it is head on the financial industry is profound, so much so, that we have seen the world’s largest asset manager (BlackRock) shift massive amounts of assets away from “old school fundamental managers” to purely quantitative strategies, and mandate that whatever fundamental managers remained use more data driven processes. Factor investing is not a new idea, but recent developments in computing and mathematics (machine learning) allow practitioners to approach factor investing from an entirely different angle.
The model makes the following underlying assumptions:
1) Predicting future returns is often considered a fool’s errand by professionals. The reason being is that there are too many factors that can impact the price of a given security at any given moment to account for accuracy. Furthermore, the impact and importance of these factors are constantly changing. Despite these challenges, most practitioners continue to try.
2) Selecting the stocks with the highest probability of a desired outcome based on common factors will lead to a more beneficial outcome than selecting stocks based on an expected (predicted) return, assuming accurate predictions are not sustainable over longer periods of time.
3) It is at least as important to avoid bad stocks as it is to invest in good ones. Selecting good stocks helps generate returns, avoiding bad stocks helps minimize losses.
The afore mentioned makes clear that there could be significant benefits to framing the question of security selection as a classification problem, as opposed to a regression problem. To solve such a problem, we propose the use of a multilayer perceptron to classify stocks into one of two categories (binary classification): 1) Above or equal to the median return of all stocks over the next year, or 2) below the median return over the next year.
Neural networks offer some advantages for this task over other supervised classification algorithms. First is the fact that MLPs have the ability to deal with nonlinear relationships. Second, traditional factor investing processes require that we identify and manipulate features on our own, letting us attack the problem directly using data. Lastly, neural networks are generally better at dealing with noisy data, a very common issue in finance.
The data in question is panel data. Which is in a class of its own. Simply put, panel data is not time series data, and it is also not cross- sectional data. In fact, it is both. The multidimensional nature of panel data adds a clear layer of complexity in deciding what model to use. On one hand, the time series aspect of the data suggests RNN, or LSTM models would be better suited to deal with the data in question. On the other hand, the tabular nature of the data in fact suggests that the MLP may be better suited for the task. In the end given the fact that our objective was to classify and not predict price in the future the time element seemed secondary, and we made the choice to treat the data as cross-sectional data and ignore/minimize the effect of time in the data making the multilayer perceptron the best choice for the task as framed
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import mean_absolute_error
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from tensorflow import keras
from keras.optimizers import adam_v2 as Adam
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.regularizers import l1, l2
from keras.callbacks import History, EarlyStopping
from keras.backend import clear_session, eval
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import plotly.tools as tls
import pickle
import re
from threading import Event
from yahoo_fin import stock_info as si
from sqlalchemy import create_engine
import psycopg2
import io
import os
from psycopg2 import connect
from sklearn.preprocessing import LabelEncoder
from ipywidgets import widgets, VBox, HBox
import warnings
warnings.simplefilter("ignore")
import portfolio as prt
from numpy.random import seed
import tensorflow as tf
import port_2 as prt
import pandas_market_calendars as mcal
import financial_metrics as fm
import sys, importlib
from sklearn.linear_model import LinearRegression
import port_charts as pch
import feature_creation as fc
import plotly
plotly.offline.init_notebook_mode()
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import keras_tuner as kt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam
import plotly.io as pio
from IPython.display import Image
The model uses data from four different categories:
1) Accounting data – this is data taken directly from a company’s financial statements (income statement, balance sheet, cashflow statement). 2) Trading data – This includes data that is based on market activity over a given period of time. 3) Valuation data – This is generally market data normalized by accounting data 4) Technical Indicators – This includes moving averages for various windows based on the price of the stock.
1) Refinitive Eikon (API) – this is a subscription based financial data provider that provides access to thousands of corporate and financial data sets for companies and markets around the globe. 2) Yahoo finance (API) – within the scope of this model the yahoo finance API is used strictly for stock price history, and for the sake of convenience. 3) Datastream Webservices – this is a subscription based financial and economic data provider that also provides access to data sets for companies, countries, and markets around the world.
To limit the need for multiple calls and to minimize the use of local storage the raw data required from afore-mentioned sources were stored in an SQL database and is updated regularly from a machine with access to an Eikon Terminal that is required for access to two of the three sources. To view the code for the calls to I have provided a notebook titled Data Collection Five Factor. The functions below will import all the required data from the database. THIS CAN TAKE UP TO ONE HOUR TO COMPLETE do not run if unnecessary.
###SQL Engine
engine = fc.engine
###set random seeds
seed(42)
tf.random.set_seed(42)
def combine_inc_bal():
"""This function imports all accounting data sets (income statment, cashflow statment, balance sheet) and merges them for further use."""
with open("assets/models/jeff_multi_factor/spy_rics.pkl", "rb") as f:
rics = pickle.load(f)
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM bal_sht"
df_balance = pd.read_sql_query(query,conn)
query = "SELECT * FROM inc_stat"
df_income = pd.read_sql_query(query,conn)
query = "SELECT * FROM cf_stat"
df_cf = pd.read_sql_query(query,conn)
inc_bal = pd.DataFrame()
for r in rics:
inc = df_income[df_income['Instrument'] == r]
bal = df_balance[df_balance['Instrument'] == r]
cf = df_cf[df_cf['Instrument'] == r]
inc_bal_1 = inc.merge(bal, on=['Instrument', "Original Announcement Date Time"])
inc_bal_1 = inc_bal_1.merge(cf, on=['Instrument', "Original Announcement Date Time"])
inc_bal_1 = inc_bal_1.rename(columns={'Original Announcement Date Time': 'date',
'EPS - Basic - incl Extraordinary Items, Common - Total': 'eps_basic',
'EPS - Basic - excl Extraordinary Items, Common - Total': 'eps_excl_basic',
'EPS - Diluted - excl Exord Items Applicable to Common Total': 'eps_dil',
'EPS Basic incl Exord, Common - Total, PoP Diff': 'eps_growth',
'Earnings before Interest & Taxes (EBIT)': 'ebit',
'Earnings before Interest Taxes Depreciation & Amortization': 'ebitda',
'Income Available to Common Shares': 'net_inc',
'Revenue from Business Activities - Total': 'rev',
'Total Assets': 'tot_assets',
'Other Assets - Total': 'oth_assets',
'Cash & Short Term Investments - Total': 'cash',
"Total Shareholders' Equity incl Minority Intr & Hybrid Debt": "book_val",
"Tangible Total Equity": "tang_book", "Debt - Total": "debt",
'Total Long Term Capital': "lt_cap",
'Intangible Assets - Total - Net': 'intang',
'Book Value per Share': 'bvps',
"Net Cash Flow from Operating Activities": "ocf",
'Capital Expenditures - Total': "capex",
'Cash Flow from Operations per Share': 'cfo_ps',
"Free Cash Flow to Equity": 'fcfe'})
inc_bal = pd.concat([inc_bal, inc_bal_1])
inc_bal['noa'] = inc_bal["tot_assets"] - inc_bal["oth_assets"]
inc_bal['ebit_bv'] = inc_bal['ebit'] / inc_bal["book_val"]
inc_bal['ebit_nonop'] = inc_bal['ebit'] / inc_bal["oth_assets"]
inc_bal['ebit_op'] = inc_bal['ebit'] / inc_bal['noa']
inc_bal['ebit_tot'] = inc_bal['ebit'] / inc_bal["tot_assets"]
inc_bal['ebit_mgn'] = inc_bal['ebit'] / inc_bal['rev']
inc_bal['net_debt'] = inc_bal['debt'] - inc_bal['cash']
inc_bal['nd_ebitda'] = inc_bal['net_debt'] / inc_bal['ebitda']
inc_bal['ni_op'] = inc_bal['net_inc'] / inc_bal['noa']
inc_bal['ni_tot'] = inc_bal['net_inc'] / inc_bal["tot_assets"]
inc_bal['ni_mgn'] = inc_bal['net_inc'] / inc_bal['rev']
inc_bal['ocf_bv'] = inc_bal['ocf'] / inc_bal["book_val"]
inc_bal['ocf_op'] = inc_bal['ocf'] / inc_bal['noa']
inc_bal['ocf_tot'] = inc_bal['ocf'] / inc_bal["tot_assets"]
inc_bal['ocf_mgn'] = inc_bal['ocf'] / inc_bal['rev']
inc_bal['ocf_ce'] = inc_bal['ocf'] / inc_bal['lt_cap']
inc_bal['ocf_bv'] = inc_bal['ocf'] / inc_bal["book_val"]
inc_bal['ocf_op'] = inc_bal['ocf'] / inc_bal['noa']
inc_bal['ocf_tot'] = inc_bal['ocf'] / inc_bal["tot_assets"]
inc_bal['ocf_mgn'] = inc_bal['ocf'] / inc_bal['rev']
inc_bal['ocf_ce'] = inc_bal['ocf'] / inc_bal['lt_cap']
inc_bal['fcf_bv'] = inc_bal['fcfe'] / inc_bal["book_val"]
inc_bal['fcf_op'] = inc_bal['fcfe'] / inc_bal['noa']
inc_bal['fcf_tot'] = inc_bal['fcfe'] / inc_bal["tot_assets"]
inc_bal['fcf_mgn'] = inc_bal['fcfe'] / inc_bal['rev']
inc_bal['fcf_ce'] = inc_bal['fcfe'] / inc_bal['lt_cap']
inc_bal['fcf_bv'] = inc_bal['fcfe'] / inc_bal["book_val"]
inc_bal['fcf_op'] = inc_bal['fcfe'] / inc_bal['noa']
inc_bal['fcf_tot'] = inc_bal['fcfe'] / inc_bal["tot_assets"]
inc_bal['fcf_mgn'] = inc_bal['fcfe'] / inc_bal['rev']
inc_bal['fcf_ce'] = inc_bal['fcfe'] / inc_bal['lt_cap']
query = "SELECT * FROM qual_dat"
qual = pd.read_sql_query(query,conn)
qual = qual.rename(columns={"Original Announcement Date Time": 'date',
"Return on Average Common Equity - %, TTM": "roe",
"Return on Average Total Assets - %, TTM": "roa",
"Return on Average Total Long Term Capital - %, TTM": "roce",
"Return on Invested Capital - %, TTM": "roic",
"Total Debt Percentage of Total Equity": 'd_e'})
qual['date'] = pd.to_datetime(qual['date']).dt.date
inc_bal['date'] = pd.to_datetime(inc_bal['date']).dt.date
inc_bal = inc_bal.merge(qual, on=['date', 'Instrument'], how='outer')
return inc_bal
def mkt_cap_feat():
"""Caluclates average market cap over various timeframes """
conn = connect(dbname = '697_temp', user = 'postgres', host = '********', password = '********')
cur = conn.cursor()
query = "SELECT * FROM mkt_cap"
mkt_cap = pd.read_sql_query(query,conn)
mkt_cap = mkt_cap.set_index('Dates')
mkt_cap.columns = [re.findall("(?<=\<)(.*?)(?=\>)", i)[0] for i in mkt_cap.columns]
mkt_cap = mkt_cap * 1000000
mkt_cap12m = mkt_cap.rolling(252).mean().dropna(axis=0, how='all')
mkt_cap12m = mkt_cap12m.rename(columns={i: '{}_12m_avg_mktcap'.format(i) for i in mkt_cap12m.columns})
mkt_cap3m = mkt_cap.rolling(63).mean().dropna(axis=0, how='all')
mkt_cap3m = mkt_cap3m.rename(columns={i: '{}_3m_avg_mktcap'.format(i) for i in mkt_cap3m.columns})
mkt_cap6m = mkt_cap.rolling(126).mean().dropna(axis=0, how='all')
mkt_cap6m = mkt_cap6m.rename(columns={i: '{}_6m_avg_mktcap'.format(i) for i in mkt_cap6m.columns})
mkt_cap_features = mkt_cap12m.join(mkt_cap6m, how='inner')
mkt_cap_features = mkt_cap_features.join(mkt_cap3m, how='inner')
mkt_cap_features.index.name = "Date"
mkt_cap_features.to_csv('assets/models/jeff_multi_factor/mkt_cap_feats.csv')
return mkt_cap_features
def merge_vol_mkt():
"""combines market cap data and volume data"""
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM mkt_cap"
mkt_cap = pd.read_sql_query(query,conn)
mkt_cap = mkt_cap.set_index('Dates')
mkt_cap.columns = [re.findall("(?<=\<)(.*?)(?=\>)", i)[0] for i in mkt_cap.columns]
mkt_cap = mkt_cap * 1000000
mkt_cap12m = mkt_cap.rolling(252).mean().dropna(axis=0, how='all')
mkt_cap12m = mkt_cap12m.rename(columns={i: '{}_12m_avg_mktcap'.format(i) for i in mkt_cap12m.columns})
mkt_cap3m = mkt_cap.rolling(63).mean().dropna(axis=0, how='all')
mkt_cap3m = mkt_cap3m.rename(columns={i: '{}_3m_avg_mktcap'.format(i) for i in mkt_cap3m.columns})
mkt_cap6m = mkt_cap.rolling(126).mean().dropna(axis=0, how='all')
mkt_cap6m = mkt_cap6m.rename(columns={i: '{}_6m_avg_mktcap'.format(i) for i in mkt_cap6m.columns})
mkt_cap_features = mkt_cap12m.join(mkt_cap6m, how='inner')
mkt_cap_features = mkt_cap_features.join(mkt_cap3m, how='inner')
mkt_cap_features.index.name = "Date"
query = "SELECT * FROM vol_12"
vol_12 = pd.read_sql_query(query,conn)
query = "SELECT * FROM vol_6"
vol_6 = pd.read_sql_query(query,conn)
query = "SELECT * FROM vol_3"
vol_3 = pd.read_sql_query(query,conn)
vol = vol_12.set_index('Date').join(vol_6.set_index('Date'), how = 'outer')
vol = vol.join(vol_3.set_index('Date'), how = 'outer')
vol.index = pd.to_datetime(vol.index)
mkt_cap_features.index = pd.to_datetime(mkt_cap_features.index)
trading = mkt_cap_features.join(vol, how='inner')
trading.index.name = 'date'
trading.index = pd.to_datetime(trading.index).date
with open("assets/models/jeff_multi_factor/spy_rics.pkl", "rb") as f:
rics = pickle.load(f)
df = pd.DataFrame()
for r in rics:
try:
vol_12 = trading['{}_12m_volume'.format(r)].to_frame(name='12m_volume')
vol_6 = trading['{}_6m_volume'.format(r)].to_frame(name='6m_volume')
vol_3 = trading['{}_3m_volume'.format(r)].to_frame(name='3m_volume')
mkt_12 = trading['{}_12m_avg_mktcap'.format(r)].to_frame(name='12m_avg_mkt_cap')
mkt_6 = trading['{}_6m_avg_mktcap'.format(r)].to_frame(name='6m_avg_mkt_cap')
mkt_3 = trading['{}_3m_avg_mktcap'.format(r)].to_frame(name='3m_avg_mkt_cap')
trad = vol_12.join(vol_6, how='outer')
trad = trad.join(vol_3, how='outer')
trad = trad.join(mkt_12, how='outer')
trad = trad.join(mkt_3, how='outer')
trad = trad.join(mkt_6, how='outer')
trad = trad.sort_index(ascending=False)
trad['Instrument'] = r
df = pd.concat([df, trad])
except:
pass
return df
def creat_labs_vol():
"""Creates labels, volatility data, and momentum data"""
with open("assets/models/jeff_multi_factor/spy_rics.pkl", "rb") as f:
rics = pickle.load(f)
tics = [i.split('.')[0] for i in rics]
tics = ['BRK-B' if i == 'BRKb' else i for i in tics]
tics = ['BF-B' if i == 'BFb' else i for i in tics]
df = pd.DataFrame()
px_dat = pd.DataFrame()
timer = Event()
for s in tics:
try:
px = si.get_data(s)
px = px.rename(columns={"ticker": "Instrument"})
prices = px.filter(['Instrument', 'close'])
px_dat = pd.concat([px_dat, prices])
ret_1yr = px['adjclose'].pct_change(252).to_frame(name='1yr_ret').shift(-252)
ret_3m = px['adjclose'].pct_change(63).to_frame(name='3m_ret').shift(-63)
ret_6m = px['adjclose'].pct_change(126).to_frame(name='6m_ret').shift(-126)
vol_1yr = px['adjclose'].pct_change().rolling(252).std().to_frame(name='1yr_vol') * np.sqrt(252)
vol_3m = px['adjclose'].pct_change().rolling(63).std().to_frame(name='3mth_vol') * np.sqrt(252)
vol_6m = px['adjclose'].pct_change().rolling(126).std().to_frame(name="6mth_vol") * np.sqrt(252)
mom_1yr = px['adjclose'].pct_change(252).to_frame(name='1yr_mom')
mom_3m = px['adjclose'].pct_change(63).to_frame(name='3m_mom')
mom_6m = px['adjclose'].pct_change(126).to_frame(name='6m_mom')
px_based = ret_1yr.join(ret_3m, how='outer')
px_based = px_based.join(ret_6m, how='outer')
px_based = px_based.join(vol_1yr, how='outer')
px_based = px_based.join(vol_6m)
px_based = px_based.join(vol_3m)
px_based = px_based.join(mom_1yr)
px_based = px_based.join(mom_6m)
px_based = px_based.join(mom_3m)
px_based['Instrument'] = s
df = pd.concat([df, px_based])
print('{}: no {} of {} complete'.format(s, len(df['Instrument'].dropna().unique()), len(tics)))
except:
timer.wait(5)
px = si.get_data(s)
px = px.rename(columns={"ticker": "Instrument"})
prices = px.filter(['Instrument', 'close'])
px_dat = pd.concat([px_dat, prices])
ret_1yr = px['adjclose'].pct_change(252).to_frame(name='1yr_ret').shift(-252)
ret_3m = px['adjclose'].pct_change(63).to_frame(name='3m_ret').shift(-63)
ret_6m = px['adjclose'].pct_change(126).to_frame(name='6m_ret').shift(-126)
vol_1yr = px['adjclose'].pct_change().rolling(252).std().to_frame(name='1yr_vol') * np.sqrt(252)
vol_3m = px['adjclose'].pct_change().rolling(63).std().to_frame(name='3mth_vol') * np.sqrt(252)
vol_6m = px['adjclose'].pct_change().rolling(126).std().to_frame(name="6mth_vol") * np.sqrt(252)
mom_1yr = px['adjclose'].pct_change(252).to_frame(name='1yr_mom')
mom_3m = px['adjclose'].pct_change(63).to_frame(name='3m_mom')
mom_6m = px['adjclose'].pct_change(126).to_frame(name='6m_mom')
px_based = ret_1yr.join(ret_3m, how='outer')
px_based = px_based.join(ret_6m, how='outer')
px_based = px_based.join(vol_1yr, how='outer')
px_based = px_based.join(vol_6m)
px_based = px_based.join(vol_3m)
px_based = px_based.join(mom_1yr)
px_based = px_based.join(mom_6m)
px_based = px_based.join(mom_3m)
px_based['Instrument'] = s
df = pd.concat([df, px_based])
print('{}: no {} of {} complete'.format(s, len(df['Instrument'].dropna().unique()), len(tics)))
engine = create_engine('postgresql+psycopg2://postgres:poRter!5067@databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com:5432/697_temp')
px_dat = px_dat.reset_index()
px_dat.to_sql("px_dat", con = engine, if_exists = 'replace', index = False)
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
px_dat.to_csv('assets/models/jeff_multi_factor/close_prices.csv', sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, "px_dat")
conn.commit()
os.remove('assets/models/jeff_multi_factor/close_prices.csv')
print("done")
return df
def valuation():
"""Creates valuation features including EV and Px pased multiples"""
df = combine_inc_bal()
df['date'] = pd.to_datetime(df['date']).dt.date
rics = list(df['Instrument'])
tics = [i.split('.')[0] for i in rics]
tics = ['BRK-B' if i == 'BRKb' else i for i in tics]
tics = ['BF-B' if i == 'BFb' else i for i in tics]
df['Instrument'] = tics
df = df.dropna(subset=['date'])
acts = df.filter(['Instrument', 'date', 'eps_excl_basic', 'bvps', 'cfo_ps', 'ebit',
'ebitda', 'rev', 'debt', 'cash'])
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM px_dat"
px = pd.read_sql_query(query,conn)
px = px.rename(columns = {"index":"date"})
px['date'] = pd.to_datetime(px['date']).dt.date
valuation_df = px.merge(acts, on=['date', 'Instrument'], how='outer').fillna(method='ffill')
valuation_df = valuation_df.drop_duplicates()
valuation_df['p_e'] = valuation_df['close'] / valuation_df["eps_excl_basic"]
valuation_df['p_b'] = valuation_df['close'] / valuation_df["bvps"]
valuation_df['p_cf'] = valuation_df['close'] / valuation_df["cfo_ps"]
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM mkt_cap"
mkt_cap = pd.read_sql_query(query,conn)
mkt_cap = mkt_cap.set_index('Dates')
mkt_cap.columns = [re.findall("(?<=\<)(.*?)(?=\>)", i)[0] for i in mkt_cap.columns]
mkt_cap.columns = [i.split('.')[0] for i in mkt_cap.columns]
mkt_cap = mkt_cap * 1000000
mkt_cap = mkt_cap.reset_index()
mkt_cap = mkt_cap.rename(columns={'Dates': 'date'})
mkt_cap['date'] = pd.to_datetime(mkt_cap['date']).dt.date
mkt_values = pd.melt(mkt_cap, id_vars='date', var_name='Instrument', value_name='mkt_cap')
valuation_df = valuation_df.merge(mkt_values, on=['date', 'Instrument'], how='outer').dropna()
valuation_df['ev'] = valuation_df['mkt_cap'] + valuation_df['debt'] - valuation_df['cash']
valuation_df['ev_ebit'] = valuation_df['ev'] / valuation_df['ebit']
valuation_df['ev_ebitda'] = valuation_df['ev'] / valuation_df['ebitda']
valuation_df['ev_sales'] = valuation_df['ev'] / valuation_df['rev']
valuation_df = valuation_df.filter(['date', 'Instrument', 'p_e', 'p_b', 'p_cf', 'ev',
'ev_ebit', 'ev_ebitda', 'ev_sales'])
return valuation_df
def technicals():
"""calculates 200 day and 50 day exponentially weighted moving averages """
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM px_dat"
px = pd.read_sql_query(query,conn)
px = px.rename(columns = {"index":"date"})
px['date'] = pd.to_datetime(px['date']).dt.date
df = pd.DataFrame()
for s in px['Instrument'].unique():
stock = px[px['Instrument'] == s]
stock['200_ma'] = stock['close'].ewm(span=200).mean()
stock['50_ma'] = stock['close'].ewm(span=50).mean()
df = pd.concat([df, stock])
return df
def merge_data(act, trd, val, tec, labs):
"""Combines all datasets"""
mkt_vol = trd.reset_index()
mkt_vol = mkt_vol.rename(columns={'index': 'date'})
rics = list(mkt_vol['Instrument'])
tics = [i.split('.')[0] for i in rics]
tics = ['BRK-B' if i == 'BRKb' else i for i in tics]
tics = ['BF-B' if i == 'BFb' else i for i in tics]
mkt_vol['Instrument'] = tics
mkt_vol['date'] = pd.to_datetime(mkt_vol['date']).dt.date
mom_labs = labs
mom_labs = mom_labs.reset_index()
mom_labs = mom_labs.rename(columns={'index': 'date'})
mom_labs['date'] = pd.to_datetime(mom_labs['date']).dt.date
labs = mom_labs.filter(['Instrument', 'date', '1yr_ret', '3m_ret', '6m_ret'])
mom = mom_labs.drop(['1yr_ret', '3m_ret', '6m_ret'], axis=1)
val = val
val['date'] = pd.to_datetime(val['date']).dt.date
tech = tec
tech['date'] = pd.to_datetime(tech['date']).dt.date
act = act
rics = list(act['Instrument'])
tics = [i.split('.')[0] for i in rics]
tics = ['BRK-B' if i == 'BRKb' else i for i in tics]
tics = ['BF-B' if i == 'BFb' else i for i in tics]
act['Instrument'] = tics
act['date'] = pd.to_datetime(act['date']).dt.date
data = mkt_vol.merge(mom, on=['date', 'Instrument'], how='inner')
data = data.merge(val, on=['date', 'Instrument'], how='inner')
data = data.merge(tech, on=['date', 'Instrument'], how='inner')
data['date'] = pd.to_datetime(data['date']).dt.date
data = data.merge(act, on=['Instrument', 'date'], how='outer')
data = data.sort_values(by=['Instrument', 'date'])
data = data.drop_duplicates()
data = data.fillna(method='ffill').dropna()
labs = labs.sort_values(by=['Instrument', 'date'])
data = data.merge(labs, on=['Instrument', 'date'], how='outer')
engine = create_engine('postgresql+psycopg2://postgres:poRter!5067@databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com:5432/697_temp')
data.to_sql("raw_data", con = engine, if_exists = 'replace', index = False, chunksize = 500)
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
data.to_csv('assets/models/jeff_multi_factor/aggregate_features.csv', sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, "raw_data")
conn.commit()
os.remove('assets/models/jeff_multi_factor/aggregate_features.csv')
print("done")
return data
def create_blocks():
"""converts dataset to monthly data, and generates data blocks"""
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM raw_data"
data = pd.read_sql_query(query,conn)
data["date"] = pd.to_datetime(data["date"])
reduced = data.dropna(subset=['12m_volume'], axis=0)
monthly = reduced.loc[reduced.date.isin(reduced.date + pd.offsets.BMonthEnd(1))]
monthly["q"] = monthly['date'].dt.quarter
monthly['yr'] = monthly["date"].dt.year
monthly['timeframe'] = monthly["q"].astype(str)+"_"+monthly["yr"].astype(str)
for t in monthly["timeframe"].unique():
tf = monthly[monthly['timeframe']==t]
tf = tf.drop(["q", "yr", "timeframe"], axis = 1)
tf.to_csv("assets/models/jeff_multi_factor/block_data{}.csv".format(t))
monthly = monthly.drop(["q", "yr", "timeframe"], axis = 1)
engine = create_engine('postgresql+psycopg2://postgres:poRter!5067@databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com:5432/697_temp')
monthly.to_sql("monthly", con = engine, if_exists = 'replace', index = False, chunksize = 500)
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
monthly.to_csv('assets/models/jeff_multi_factor/monthly.csv', sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, "monthly")
conn.commit()
os.remove('assets/models/jeff_multi_factor/monthly.csv')
print("done")
return monthly
def run_feature_engine():
"""This function will run all the functions used to compile the data needed for the model from the database, and will generate blocks this function will take about 1 hour to run"""
act = combine_inc_bal()
print('act_done')
trd = merge_vol_mkt()
print('trd_done')
labs = creat_labs_vol()
print('labs_done')
val = valuation()
print('val_done')
tec = technicals()
print('tec_done')
raw = merge_data(act, trd, val, tec, labs)
print("DB updated")
monthly = create_blocks()
print("Done")
return monthly
#to recreate features uncomment below, run, and wait...or just move on to the next cell to import pre-engineered features
#run_feature_engine()
###helper
def batch(iterable, n=1):
"""Helper function that assist in managing batch sizes"""
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
def import_monthly():
"""Imports our dataset of monthly features"""
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM monthly"
data = pd.read_sql_query(query,conn)
data = data.sort_values(['Instrument', 'date'])
data['date'] = pd.to_datetime(data['date'])
return data
data = import_monthly()
The following data evaluation process has three purposes.
The histogram charts below allow us to approximate the distribution of the various features fit into the model. Due to the asymmetric nature of many of the features and the difficulty of identifying observations in the tails of the distribution via standard distribution plots, the data below was transformed to a logarithmic scale. After the date it was transformed, the data is fraught with outliers.
reduced = data.dropna(subset=['12m_volume'], axis = 0)
dist_dat = reduced.drop(['3m_ret', '6m_ret', 'Instrument', 'date'], axis = 1)
dist_dat = dist_dat.replace([np.inf, -np.inf], np.nan)
dist_dat = dist_dat.dropna()
scaled = StandardScaler().fit_transform(dist_dat)
dist_dat_2 = pd.DataFrame(scaled, columns = dist_dat.columns)
xaxes = ['x_{}_log'.format(i) for i in dist_dat.columns]
yaxes = ['freq' for i in dist_dat.columns]
titles = dist_dat.columns
palette = itertools.cycle(sns.color_palette())
f,a = plt.subplots(14,5, figsize = (20,30))
plt.subplots_adjust(top=0.96)
a = a.ravel()
for idx, ax in enumerate(a):
c = next(palette)
ax.hist(dist_dat_2[dist_dat_2.columns[idx]],histtype = u'stepfilled', density=True, color = c, log = True)
ax.set_title(titles[idx])
ax.set_xlabel(xaxes[idx])
ax.set_ylabel(yaxes[idx])
plt.suptitle('Data Distribution Profile', y = 0.98)
plt.show()
Below is a basic correlation matrix. The correlation matrix makes very clearly that for the most part correlations are weak (Close to zero) with pockets of stronger positive correlations amongst certain features (accounting features in particular). We can deduce the following from the correlation matrix:
###interactive correlation matrix
reduced = data.dropna(subset=['12m_volume'], axis = 0)
labels = reduced.filter(['date','Instrument','1yr_ret', '3m_ret', '6m_ret'])
features = reduced.drop(['1yr_ret', '3m_ret', '6m_ret'], axis = 1)
corr_dat = reduced.drop(['3m_ret', '6m_ret'], axis = 1).dropna()
corr_dat = corr_dat.groupby('date').corr()[['1yr_ret']].reset_index()
corr_dat = corr_dat[corr_dat['level_1']!='1yr_ret']
corr = reduced.drop(['3m_ret', '6m_ret'], axis = 1).dropna()
corr = corr.corr()
fig = go.Figure(data=go.Heatmap(
z = corr,
x = corr.index,
y = corr.columns,
xgap =2,
ygap =2,
colorscale="RdBu"
))
fig.update_layout(
title = "Correlation Matrix",
title_x = 0.5,
paper_bgcolor='white',
plot_bgcolor='white',
height = 900,
width = 900,
margin=dict(
l=30,
r=30,
b=30,
t=40,
),
)
fig.update_xaxes(tickfont = dict(size = 8))
fig.update_yaxes(tickfont = dict(size = 8))
Image(pio.to_image(fig, format='png'))
We are not only interested in the correlation between the features and 1 year forward returns but also in the distribution of such correlations. This allows us to better understand the impact any outliers may have on the data. The box plots below illustrate this distribution. The correlations are computed on a date-by-date basis for the entire cross-section of stocks. The plot shows that for the most part correlations revolve around zero, but can go as high as 0.7 at the higher end of 1 year vol and as low as -0.47 for a three month momentum. The key takeaway from this segment is that not only are correlations between features and one year forward returns low as demonstrated previously from the correlation matrix, but the dispersion within the correlation data of every feature to one-year returns is significant in many cases.
x_dat = corr_dat['level_1'].unique()
y_dat = [list(corr_dat[corr_dat['level_1']==i]['1yr_ret']) for i in x_dat]
colors = ['rgb(136, 204, 238)', 'rgb(204, 102, 119)', 'rgb(221, 204, 119)', 'rgb(17, 119, 51)', 'rgb(51, 34, 136)',
'rgb(170, 68, 153)', 'rgb(68, 170, 153)', 'rgb(153, 153, 51)', 'rgb(136, 34, 85)', 'rgb(102, 17, 0)',
'rgb(136, 136, 136)']
xaxes = ['{}'.format(i) for i in x_dat]
yaxes = ['dist' for i in dist_dat.columns]
titles = dist_dat.columns
palette = itertools.cycle(sns.color_palette())
f,a = plt.subplots(23,3, figsize = (20,50))
plt.subplots_adjust(top=0.97)
a = a.ravel()
for idx, ax in enumerate(a):
c = next(palette)
ax.boxplot(y_dat[idx], patch_artist = True, showmeans = True, boxprops = dict(facecolor = c, alpha = 0.3), medianprops = dict(color = 'black'), labels = [''])
ax.set_title(titles[idx])
ax.set_xlabel(None)
ax.set_ylabel(yaxes[idx])
plt.suptitle('Distribution of Correlations', y = 0.98)
plt.show()
Effectively the data has forced us to make one of two conclusions:
OR
To deal with this issue we turn to domain knowledge and based on said knowledge it is difficult to accept the idea that a collection of many of the most commonly used features for picking stocks would have nothing to do with stock returns. As such, option number two appears more likely.
To eliminate outliers, we chose to use the isolation forest anomaly detection algorithm. The general idea behind the isolation forest algorithm is that when dealing with large amounts of data, it is easier to identify anomalies than it is to identify normal points. The algorithm effectively works like a decision tree that randomly partitions data and seeks outliers based on a random split between the maximum and minimum of the data. The logic here is that a random partition should usually occur closer to the root of the tree because outliers usually occur further away.
As previously discussed, the model's objective is to classify stocks into top 50% gainers and bottom 50% gainers. We decided on median because in addition to being a widely used measure of central tendency, it ensures a balanced training set regardless of the data's distribution. This will ensure that the model evaluates stocks on a relative basis. We believe this to be advantageous despite the potential misclassifications.
###Outlier detection and removal using isolation forests
training_block = pd.read_csv("assets/models/jeff_multi_factor/block_data1_2019.csv", index_col = 0)
training_block = training_block.drop(['3m_ret', '6m_ret', 'Instrument', 'date'], axis = 1)
training_block['label'] = [1 if i >= training_block['1yr_ret'].median() else 0 for i in training_block['1yr_ret']]
training_block = training_block.drop(['1yr_ret'], axis = 1)
training_block = training_block.replace([np.inf, -np.inf], np.nan)
training_block = training_block.dropna()
data = training_block.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
iso = IsolationForest(contamination=0.2)
yhat = iso.fit_predict(X)
mask = yhat != -1
X, y = X[mask, :], y[mask]
print(X.shape, y.shape)
(1467, 69) (1467,) (1173, 69) (1173,)
Based on the isolation forest model used to detect outliers in our data, we have discarded about 22.5% of the original observations due to extreme observations. The remaining data must now be scaled and split into training and test sets. Due to the balanced nature of the data, and the relatively small number of observations, we decided to split it into 80% train and 20% test.
###Training_set Normalization
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,)
scalar = StandardScaler().fit(X_train)
scaled_X_train = scalar.transform(X_train)
scaled_X_test = scalar.transform(X_test)
We have used The Keras package to construct our model. We chose to build a sequential model with three hidden layers and multiple nodes, all of which use the 'relu' activation function, and the output layer with a single node that uses the sigmoid activation function. We use the binary cross entropy loss function and the adam optimizer. We also include dropouts after every hidden layer to help minimize potential overfitting and use l2 regularization to help with any effects we may experience because of multicollinearity. Lastly, we use the Keras Random Search tuner to identify the ideal number of nodes in each hidden layer, and the most effective learning rate for the optimizer. The tuner objective is set to minimize the loss function results for the validation (test) set.
def build_mod(hp):
tf.random.set_seed(42)
hp_units1 = hp.Int('units1', min_value=8, max_value=16, step=1)
hp_units2 = hp.Int('units2', min_value=6, max_value=8, step=1)
hp_units3 = hp.Int('units3', min_value=2, max_value=6, step=1)
model = Sequential()
model.add(Dense(units = hp_units1, activation='relu', input_dim = 69, activity_regularizer = l2(0.002)))
model.add(Dropout(0.5))
model.add(Dense(units = hp_units2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units = hp_units3, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer= Adam(hp.Choice("learning_rate", values=[1e-2, 1e-3, 1e-4])), metrics=['binary_accuracy', 'Precision', 'Recall'])
return model
tuner = kt.RandomSearch(build_mod, objective = kt.Objective("val_loss",direction="min"), max_trials = 25, seed = 42, overwrite = True)
tuner.search(scaled_X_train, y_train, epochs = 150, batch_size=32, validation_data = (scaled_X_test, y_test), verbose = 0)
INFO:tensorflow:Oracle triggered exit
tuner.results_summary(1)
Results summary Results in .\untitled_project Showing 1 best trials Objective(name='val_loss', direction='min') Trial summary Hyperparameters: units1: 15 units2: 8 units3: 4 learning_rate: 0.01 Score: 0.6365458965301514
### store model hyperparams for later use
model = tuner.get_best_models()[0]
config = model.get_config()
hl_1 = config['layers'][1]['config']['units']
hl_2 = config['layers'][3]['config']['units']
hl_3 = config['layers'][5]['config']['units']
lr = eval(model.optimizer.lr)
Once the tuner has identified the best model, we retrain the model using the identified parameters in order to evaluate the model's performance over all iterations, based on the following metrics:
###Retrain the model and record History for evaluation
clear_session()
tf.random.set_seed(42)
#es = EarlyStopping(monitor='val_loss', min_delta=0.001,patience=20, restore_best_weights=True)
model = Sequential()
model.add(Dense(hl_1, activation='relu', input_dim = 69, activity_regularizer = l2(0.002)))
model.add(Dropout(0.5))
model.add(Dense(hl_2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(hl_3, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate = lr), metrics=['binary_accuracy', 'Precision', 'Recall'])
history = model.fit(scaled_X_train, y_train, epochs=150, batch_size=32, validation_data = (scaled_X_test, y_test), verbose = 0)
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
Both the training and testing loss results decline steadily over the first 125 epochs. At this point the validations set begins to level off, and the model can no longer improve on the loss. By epoch 150 the model appears to be slightly overfitting (0.61 train vs 0.68 val) but this does not seem to be extreme or impact the other metrics at the moment. Binary accuracy steadily climbs for both the training and validation sets throughout the model, though the gradual climb begins to taper off around 0.62. While this may feel like a small number, context is important. If an algorithm that can successfully pick the top 50% of stocks with 90% accuracy existed, everyone would adopt it, and as people continued to adopt it, its accuracy and predictive power would decline. As such, the 62 - 65 accuracy range appears reasonable. The precision data is an indication that our model is doing a reasonable job avoiding bad stocks. The weakest number is in fact the recall, which shows that the model is having difficulty isolating good stocks, i.e., higher false negatives.
color_list = ['#6929c4', '#1192e8', '#005d5d', '#9f1853', '#fa4d56', '#570408', '#198038', '#002d9c', '#ee538b', '#b28600', '#009d9a', '#012749']
fig = make_subplots(rows=2, cols=2,shared_yaxes = False, vertical_spacing = 0.1,
subplot_titles = ('loss', 'binary_accuracy', 'precision', 'recall' ))
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['loss'])+1)], y = history.history['loss'], name = 'train', line = {'color':color_list[0]}), row = 1, col = 1)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['val_loss'])+1)], y = history.history['val_loss'], name = 'val', line = {'color':color_list[1]}), row = 1, col = 1)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['binary_accuracy'])+1)], y = history.history['binary_accuracy'], name = 'train', line = {'color':color_list[2]}), row = 1, col = 2)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['binary_accuracy'])+1)], y = history.history['val_binary_accuracy'], name = 'val', line = {'color':color_list[3]}), row = 1, col = 2)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['precision'])+1)], y = history.history['precision'], name = 'train', line = {'color':color_list[4]}), row = 2, col = 1)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['val_precision'])+1)], y = history.history['val_precision'], name = 'val', line = {'color':color_list[5]}), row = 2, col = 1)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['recall'])+1)], y = history.history['recall'], name = 'train', line = {'color':color_list[6]}), row = 2, col = 2)
fig.add_trace(go.Scatter(x = [i for i in range(1, len(history.history['recall'])+1)], y = history.history['val_recall'], name = 'val', line = {'color':color_list[7]}), row = 2, col = 2)
#fig.update_yaxes(range=[0.5,1], row = 1, col = 1)
fig.update_layout(
paper_bgcolor='white',
plot_bgcolor= 'white',
margin=dict(l=48, r=48, t=100, b=10),
height = 700,
width = 900,
showlegend=True,
title = dict(text = 'ANN Multifactor Model Results', x = 0.5, y = 0.97))
fig.update_layout(hovermode="x unified")
Image(pio.to_image(fig, format='png'))
### the model history is stored in a csv file for later use.
ml_stats = pd.DataFrame().from_dict(history.history)
ml_stats.to_csv('assets/models/jeff_multi_factor/ml_stats/mf_mlp.csv', index = False)
clear_session()
We are now ready to put our model to use in a simulated environment. To do this, we first generate a pipeline that takes in raw data, creates features that are free of outliers and properly scaled, as well as a label based on the expected one-year return for each stock at each point in time, this data is then used to train our model, which in turn makes a classification for all stocks based on the next day’s data. Simply put, we are using 3 months of data to train our model and make one prediction based on that data. That data is then discarded, and a fresh set of data is used to train the next iteration
Regime change in the context of the stock market is when "something changes" as the market no longer behaves the way it had previously. It is precisely the issue of regime change that makes stock selection particularly difficult to do successfully over extended periods of time. Regime change is always obvious in hindsight, and almost never visible in advance. Any model that searches for patterns in data can have a particularly difficult time classifying properly once stocks begin to behave differently than they had previously. To deal with this, our model is constructed to discard old data once it is used, and never re-introduce it into the models. This is also the reason this model is structured to "trade" quarterly.
###function imports data to be used for prediction after the model is trained
def import_trade_data(trade_date):
"""
:param trade_date: a string must be in YYYY-MM-DD format
function calls raw data for all stocks on a specific date from SQL database
"""
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM raw_data WHERE date = '{}'".format(trade_date)
trade_date = pd.to_datetime(trade_date)
dat = pd.read_sql_query(query,conn)
return dat
def import_px_data(trade_date):
"""
:param trade_date: a string must be in YYYY-MM-DD format
function calls price data for all stocks on a specific date from SQL database
"""
conn = connect(dbname = '697_temp', user = 'postgres', host = 'databasesec.cvhiyxfodl3e.us-east-2.rds.amazonaws.com', password = 'poRter!5067')
cur = conn.cursor()
query = "SELECT * FROM px_dat WHERE index = '{}'".format(trade_date)
trade_date = pd.to_datetime(trade_date)
dat = pd.read_sql_query(query,conn)
return dat
###pipeline
def run_model(trade_date):
"""
:param trade_date: a string must be in YYYY-MM-DD format
this function runs our model pipline from start to finish
returns a df of classifications for each stock
"""
seed(42)
tf.random.set_seed(42)
q = pd.to_datetime(trade_date).quarter
yr = pd.to_datetime(trade_date).year - 1
training_block = pd.read_csv("assets/models/jeff_multi_factor/block_data{}_{}.csv".format(q, yr), index_col = 0)
training_block = training_block.drop(['3m_ret', '6m_ret', 'Instrument', 'date'], axis = 1)
training_block['label'] = [1 if i >= training_block['1yr_ret'].median() else 0 for i in training_block['1yr_ret']]
training_block = training_block.drop(['1yr_ret'], axis = 1)
training_block = training_block.replace([np.inf, -np.inf], np.nan)
training_block = training_block.dropna()
data = training_block.values
X, y = data[:, :-1], data[:, -1]
print('No of observations pre outlier removal: {}'.format( X.shape[0]))
pre = X.shape[0]
iso = IsolationForest(contamination=0.2)
yhat = iso.fit_predict(X)
mask = yhat != -1
X, y = X[mask, :], y[mask]
print('No of observations after outlier removal: {}'.format(X.shape[0]))
percent_kept = X.shape[0]/pre
print('{:.2%} of the data was kept'.format(percent_kept))
scalar = StandardScaler().fit(X)
scaled_X = scalar.transform(X)
model = Sequential()
model.add(Dense(hl_1, activation='relu', input_dim = 69, activity_regularizer = l2(0.002)))
model.add(Dropout(0.5))
model.add(Dense(hl_2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(hl_3, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate = lr), metrics=['binary_accuracy', 'Precision', 'Recall'])
model.fit(scaled_X, y, epochs=148, batch_size=32, verbose = 0)
score = model.evaluate(scaled_X, y, verbose = 0)
print('Model loss:', score[0])
print('Model accuracy:', score[1])
print('Model Precission:', score[2])
print('Model Recall:', score[3])
input_raw = import_trade_data(trade_date).dropna(subset = ['12m_volume'])
results = list(input_raw['1yr_ret'])
stocks = list(input_raw['Instrument'])
input_raw = input_raw.replace([np.inf, -np.inf], np.nan)
input_raw = input_raw.drop(['3m_ret', '6m_ret', 'Instrument', 'date', '1yr_ret'], axis = 1)
mkt_data = input_raw .values
scaled_mkt = scalar.transform(mkt_data)
y_prob = model.predict_proba(scaled_mkt)
y_class = model.predict_classes(scaled_mkt)
output_df = input_raw.copy()
output_df['Instrument'] = stocks
output_df['class_pred'] = y_class
output_df['probab'] = y_prob
try:
output_df['True'] = [1 if i >= np.median(results) else 0 for i in results]
except:
output_df['True'] = ['unkonwon' for i in range(0, len(results))]
clear_session()
return output_df
def construct_portfolio(output_df, trade_date):
"""
:param output_df: a df produced by the model pipeline function
:param trade_date: a string must be in YYYY-MM-DD format
returns a df of stocks used to build the portfolio
"""
stocks = output_df.filter(['3m_avg_mkt_cap','Instrument', 'class_pred', 'probab'])
stocks = stocks[stocks['class_pred']==1].sort_values('probab', ascending = False).reset_index().drop('index', axis = 1)[0:50]
stocks = stocks.sort_values('3m_avg_mkt_cap', ascending = False)
stocks['weights'] = stocks["3m_avg_mkt_cap"]/stocks["3m_avg_mkt_cap"].sum()
stocks = stocks[stocks['weights']>=0.02]
stocks['weights'] = stocks['weights']/stocks['weights'] .sum()
px_dat = import_px_data(trade_date).drop('index', axis = 1)
stocks = stocks.merge(px_dat, on = 'Instrument' , how = 'inner')
return stocks
def backtest_prep():
"""
loops through trade dates, runs the model pipeline, and generates a list of trade to place based on model results"""
dates = ['2019-10-01', '2020-01-02', '2020-04-01', '2020-07-01', '2020-10-01', '2021-01-04', '2021-04-01']
transaction_df = pd.DataFrame()
for d in dates:
stocks = run_model(d)
port = construct_portfolio(stocks,d)
port['Date'] = d
to_df = port.filter(['Date','Instrument', 'weights', 'close'])
transaction_df = pd.concat([transaction_df, to_df])
transaction_df.to_csv('assets/models/jeff_multi_factor/bt_transactions.csv',index = False)
return
backtest_prep()
No of observations pre outlier removal: 489 No of observations after outlier removal: 391 79.96% of the data was kept Model loss: 0.45085418224334717 Model accuracy: 0.7851662635803223 Model Precission: 0.7137546539306641 Model Recall: 0.9648241400718689 No of observations pre outlier removal: 1467 No of observations after outlier removal: 1173 79.96% of the data was kept Model loss: 0.5391020774841309 Model accuracy: 0.7706735134124756 Model Precission: 0.8387096524238586 Model Recall: 0.6678082346916199 No of observations pre outlier removal: 1467 No of observations after outlier removal: 1173 79.96% of the data was kept Model loss: 0.4786199629306793 Model accuracy: 0.7962489128112793 Model Precission: 0.8758314847946167 Model Recall: 0.6833909749984741 No of observations pre outlier removal: 1469 No of observations after outlier removal: 1175 79.99% of the data was kept Model loss: 0.49982428550720215 Model accuracy: 0.7651063799858093 Model Precission: 0.9522387981414795 Model Recall: 0.5509499311447144 No of observations pre outlier removal: 1477 No of observations after outlier removal: 1181 79.96% of the data was kept Model loss: 0.44042694568634033 Model accuracy: 0.8264182806015015 Model Precission: 0.8534798622131348 Model Recall: 0.7884940505027771 No of observations pre outlier removal: 1476 No of observations after outlier removal: 1181 80.01% of the data was kept Model loss: 0.4339868724346161 Model accuracy: 0.8323454856872559 Model Precission: 0.8336314558982849 Model Recall: 0.8161121010780334 No of observations pre outlier removal: 1477 No of observations after outlier removal: 1181 79.96% of the data was kept Model loss: 0.3425152599811554 Model accuracy: 0.8755292296409607 Model Precission: 0.9163498282432556 Model Recall: 0.8239316344261169
#test.reset_account()
#establish a portfolio using the port_2 class developed for this project
test = prt.portfolio(start_date='2019-10-01', value=1000000, end_date='2021-07-02')
The trading engine is one of the more difficult parts of this project. It requires that we simulate both a market, and a financial services firm at the same time. This is done by taking predetrmined transactions for the list generated above, executing them using historical prices, and updating performance between dates. The process is also designed to ensure that enough funds exist for transactions, that additional lots are properly accounted for, that performance is continuosly calculated, and positions are known at all times, and no short selling is allowd. Though there are similarities accross models, each model has its own trading engine based on the guidlines established for the model.
transactions = pd.read_csv("assets/models/jeff_multi_factor/bt_transactions.csv")
transactions['Date'] = pd.to_datetime(transactions['Date'])
trade_dates = transactions['Date'].unique()
nyse = mcal.get_calendar('NYSE')
dates = nyse.schedule(start_date='2019-10-01', end_date='2021-07-02')['market_close'].index
date_prep = pd.DataFrame({'Dates':dates})
date_prep['Quarter'] = date_prep['Dates'].dt.quarter
date_prep['Year'] = date_prep['Dates'].dt.year
dates = list(date_prep.groupby(['Year', 'Quarter']).min()['Dates'])
dates = [i.strftime('%Y-%m-%d') for i in dates]
for d in dates[0:-1]:
port = test.open_positions_df
px_exist = [test.get_price(d,i) for i in list(port['Ticker'])]
port['Last'] = px_exist
port["Current Value"] = port['Last'] * port['Quantity']
new_port = transactions[transactions['Date'] == d]
px_new = [test.get_price(d,i) for i in list(new_port['Instrument'])]
new_port['close'] = px_new
new_port['shares'] = ((new_port['weights'] * (port['Current Value'].sum() + test.current_cash))//new_port['close']).astype(int)
if len(port)!=0:
stocks_to_sell = list(port['Ticker'])
shares_to_sell = list(port['Quantity'])
so = dict(zip(stocks_to_sell, shares_to_sell))
stocks_to_buy = list(new_port['Instrument'])
shares_to_buy = list(new_port['shares'])
po = dict(zip(stocks_to_buy, shares_to_buy))
test.execute_trades(so, date =d, t_type='Sell' )
test.execute_trades(po, date =d, t_type='buy' )
else:
stocks_to_buy = list(new_port['Instrument'])
shares_to_buy = list(new_port['shares'])
po = dict(zip(stocks_to_buy, shares_to_buy))
test.execute_trades(po, date =d, t_type='buy' )
test.update_portfolio(date =dates[-1] , cash_add = None)
Order to BUY 1577 shares of WMT validated and approved Order to BUY 577 shares of MA validated and approved Order to BUY 1201 shares of CVX validated and approved Order to BUY 301 shares of BA validated and approved Order to BUY 253 shares of COST validated and approved Order to BUY 398 shares of DHR validated and approved Order to BUY 833 shares of ABBV validated and approved Order to BUY 122 shares of CHTR validated and approved Order to BUY 23 shares of BKNG validated and approved Order to BUY 992 shares of MS validated and approved Order to BUY 91 shares of BLK validated and approved Order to BUY 503 shares of CL validated and approved Order to BUY 112 shares of NOW validated and approved Trades Executed Order to SELL 1577 shares of WMT validated and approved Order to SELL 577 shares of MA validated and approved Order to SELL 1201 shares of CVX validated and approved Order to SELL 301 shares of BA validated and approved Order to SELL 253 shares of COST validated and approved Order to SELL 398 shares of DHR validated and approved Order to SELL 833 shares of ABBV validated and approved Order to SELL 122 shares of CHTR validated and approved Order to SELL 23 shares of BKNG validated and approved Order to SELL 992 shares of MS validated and approved Order to SELL 91 shares of BLK validated and approved Order to SELL 503 shares of CL validated and approved Order to SELL 112 shares of NOW validated and approved Trades Executed Order to BUY 1644 shares of MSFT validated and approved Order to BUY 106 shares of AMZN validated and approved Order to BUY 584 shares of JNJ validated and approved Order to BUY 665 shares of WMT validated and approved Order to BUY 582 shares of PG validated and approved Order to BUY 1055 shares of XOM validated and approved Order to BUY 1811 shares of T validated and approved Order to BUY 394 shares of DIS validated and approved Order to BUY 993 shares of VZ validated and approved Order to BUY 996 shares of WFC validated and approved Order to BUY 1051 shares of CMCSA validated and approved Order to BUY 218 shares of IBM validated and approved Trades Executed Order to SELL 1644 shares of MSFT validated and approved Order to SELL 106 shares of AMZN validated and approved Order to SELL 584 shares of JNJ validated and approved Order to SELL 665 shares of WMT validated and approved Order to SELL 582 shares of PG validated and approved Order to SELL 1055 shares of XOM validated and approved Order to SELL 1811 shares of T validated and approved Order to SELL 394 shares of DIS validated and approved Order to SELL 993 shares of VZ validated and approved Order to SELL 996 shares of WFC validated and approved Order to SELL 1051 shares of CMCSA validated and approved Order to SELL 218 shares of IBM validated and approved Trades Executed Order to BUY 3992 shares of AAPL validated and approved Order to BUY 94 shares of AMZN validated and approved Order to BUY 68 shares of GOOGL validated and approved Order to BUY 551 shares of WMT validated and approved Order to BUY 207 shares of UNH validated and approved Order to BUY 1890 shares of T validated and approved Order to BUY 258 shares of HD validated and approved Order to BUY 885 shares of VZ validated and approved Order to BUY 1276 shares of XOM validated and approved Order to BUY 1112 shares of CMCSA validated and approved Order to BUY 533 shares of CVX validated and approved Order to BUY 210 shares of BA validated and approved Order to BUY 91 shares of COST validated and approved Trades Executed Order to SELL 3992 shares of AAPL validated and approved Order to SELL 94 shares of AMZN validated and approved Order to SELL 68 shares of GOOGL validated and approved Order to SELL 551 shares of WMT validated and approved Order to SELL 207 shares of UNH validated and approved Order to SELL 1890 shares of T validated and approved Order to SELL 258 shares of HD validated and approved Order to SELL 885 shares of VZ validated and approved Order to SELL 1276 shares of XOM validated and approved Order to SELL 1112 shares of CMCSA validated and approved Order to SELL 533 shares of CVX validated and approved Order to SELL 210 shares of BA validated and approved Order to SELL 91 shares of COST validated and approved Trades Executed Order to BUY 2186 shares of MSFT validated and approved Order to BUY 694 shares of FB validated and approved Order to BUY 90 shares of GOOGL validated and approved Order to BUY 514 shares of V validated and approved Order to BUY 297 shares of UNH validated and approved Order to BUY 1399 shares of INTC validated and approved Order to BUY 325 shares of HD validated and approved Order to BUY 637 shares of CVX validated and approved Order to BUY 241 shares of MCD validated and approved Order to BUY 146 shares of COST validated and approved Trades Executed Order to SELL 2186 shares of MSFT validated and approved Order to SELL 694 shares of FB validated and approved Order to SELL 90 shares of GOOGL validated and approved Order to SELL 514 shares of V validated and approved Order to SELL 297 shares of UNH validated and approved Order to SELL 1399 shares of INTC validated and approved Order to SELL 325 shares of HD validated and approved Order to SELL 637 shares of CVX validated and approved Order to SELL 241 shares of MCD validated and approved Order to SELL 146 shares of COST validated and approved Trades Executed Order to BUY 3023 shares of AAPL validated and approved Order to BUY 1404 shares of MSFT validated and approved Order to BUY 92 shares of AMZN validated and approved Order to BUY 505 shares of WMT validated and approved Order to BUY 141 shares of TSLA validated and approved Order to BUY 176 shares of MA validated and approved Order to BUY 199 shares of HD validated and approved Order to BUY 99 shares of NVDA validated and approved Order to BUY 147 shares of CRM validated and approved Trades Executed Order to SELL 3023 shares of AAPL validated and approved Order to SELL 1404 shares of MSFT validated and approved Order to SELL 92 shares of AMZN validated and approved Order to SELL 505 shares of WMT validated and approved Order to SELL 141 shares of TSLA validated and approved Order to SELL 176 shares of MA validated and approved Order to SELL 199 shares of HD validated and approved Order to SELL 99 shares of NVDA validated and approved Order to SELL 147 shares of CRM validated and approved Trades Executed Order to BUY 2973 shares of AAPL validated and approved Order to BUY 1406 shares of MSFT validated and approved Order to BUY 459 shares of FB validated and approved Order to BUY 55 shares of GOOGL validated and approved Order to BUY 129 shares of TSLA validated and approved Order to BUY 300 shares of V validated and approved Order to BUY 117 shares of NVDA validated and approved Order to BUY 175 shares of MA validated and approved Order to BUY 197 shares of PYPL validated and approved Order to BUY 88 shares of ADBE validated and approved Order to BUY 79 shares of NFLX validated and approved Order to BUY 188 shares of CRM validated and approved Trades Executed Order to SELL 2973 shares of AAPL validated and approved Order to SELL 1406 shares of MSFT validated and approved Order to SELL 459 shares of FB validated and approved Order to SELL 55 shares of GOOGL validated and approved Order to SELL 129 shares of TSLA validated and approved Order to SELL 300 shares of V validated and approved Order to SELL 117 shares of NVDA validated and approved Order to SELL 175 shares of MA validated and approved Order to SELL 197 shares of PYPL validated and approved Order to SELL 88 shares of ADBE validated and approved Order to SELL 79 shares of NFLX validated and approved Order to SELL 188 shares of CRM validated and approved Trades Executed Order to BUY 2467 shares of JPM validated and approved Order to BUY 6505 shares of BAC validated and approved Order to BUY 1535 shares of CVX validated and approved Order to BUY 3171 shares of WFC validated and approved Order to BUY 1540 shares of MS validated and approved Order to BUY 1636 shares of C validated and approved Order to BUY 432 shares of BA validated and approved Order to BUY 279 shares of GS validated and approved Order to BUY 1159 shares of USB validated and approved Trades Executed
### updates the portfolio's performance between the last trading day and the end of the time horizon
test.update_portfolio(date =dates[-1] , cash_add = None)
### Snapshots offer a view of a portfolio in time as it stod/stands
snap_port = test.snapshots['Positions_2019-12-31']
snap_cash = test.snapshots['cash_2019-12-31']
### the track record atribute is a record of a portfolio's performance over time
tr = test.track_record
Portfolios usually evaluate it in both risk adjusted and absolute terms. Evaluating a portfolio in absolute terms is often as simple as looking at the performance overtime and comparing it to a peer or benchmark. Whereas evaluating a portfolio in risk-adjusted terms usually requires that the analyst “level the playing field” somehow (math is usually involved).
In absolute terms the portfolio performance is okay for the most part. We do see periods of underperformance initially, but the model did a respectable job protecting the downside during the start of the COVID pandemic. In fact, at the market's trough, on March 23 of 2020, the market was down -24% from the start of the evaluation, while the model down only -15%. Since that time, the model has for the most part outperformed in absolute terms. The model did have some difficulty with regime change as the market went from growth to value in late Jan of 21 but recovered nicely after the next trading date.
perf = pch.performance_chart(tr, 'SPY')
Image(pio.to_image(perf, format='png'))
We use 5 major risk adjusted metrics to evaluate our portfolio
The risk adjusted metrics paint a much brighter picture for the model as ALL shows that the model outperforms the market nicely.
ram = pch.risk_adjusted_metrics(tr,'SPY')
Image(pio.to_image(ram, format='png'))
The capital asset pricing model or CAPM is one of the most widely used models in finance. It is used to evaluate risk, calculate expected returns, and measure portfolio outperformance. It is interpreted as follows:
capm = pch.capm_res(tr, "SPY")
Image(pio.to_image(capm, format='png'))